Executive summary

In this report, firstly, all data sets were cleaned and then a comprehensive summary with statistics was presented for each of them. Then, interesting relationships between World Development Indicators were explored. Correlations between most of the indicators were presented for general overview. Subsequently, more specific correlations between GDP per capita were shown, with visualizations. For example, association between GDP per capita and life expectancy was inspected. Afterwards, correlations of World Development Indicators with gold prices were analyzed, as preparation for creating a gold price prediction model. Lastly, using gathered knowledge, a regression model was created, which predicts gold prices with sufficient accuracy.

Libraries used

Following libraries and dependencies were used in the report:

library(knitr)

library(kableExtra)
library(DT)

library(tidyverse)
library(readxl)
library(janitor)

library(lubridate)

library(skimr)

library(corrr)

library(ggplot2)
library(ggthemes)
library(gganimate)
library(ggcorrplot)
library(plotly)

library(caret)
library(missForest)

library(countrycode)

Loading data

The input data includes information on the economic indicators and development of each country as measured by over 200 statistics. In addition, it includes information on currency exchange rates, gold prices, Bitcoin trading, and the monthly performance of the S&P Composite.

The data was collected by various institutions, notably the World Bank and Nasdaq.

# https://data.worldbank.org/
wdi_df <- read_excel('./data/World_Development_Indicators.xlsx', 
                     na = c('', '..')) %>% head(-5)

currency_df <- read_csv('./data/CurrencyExchangeRates.csv', 
                        col_types = cols(Date = col_date("%Y-%m-%d")))

# https://data.nasdaq.com/data/LBMA/GOLD-gold-price-london-fixing
gold_prices_df <- read_csv('./data/Gold prices.csv')

# https://data.nasdaq.com/data/YALE/SPCOMP-sp-composite
sp_df <- read_csv('./data/S&P Composite.csv')

# https://data.nasdaq.com/data/BCHAIN-blockchain
btc_diff_df <- read_csv('./data/bitcoin/BCHAIN-DIFF.csv')
btc_hrate_df <- read_csv('./data/bitcoin/BCHAIN-HRATE.csv')
btc_mkpru_df <- read_csv('./data/bitcoin/BCHAIN-MKPRU.csv')
btc_trvou_df <- read_csv('./data/bitcoin/BCHAIN-TRVOU.csv')

Cleaning data sets

While cleaning each data set, decisions described below were made.

World Development Indicators

  1. As part of loading data, empty and .. values were interpreted as NA values. Last 5 rows were also dropped, because they contained data set metadata, such as last update date.
  2. Country Code and Series Code columns were dropped, as they were not useful in further analysis
  3. All columns except Country Name and Series Name were pivoted into longer format; names of columns were stored into Year column
  4. All columns were pivoted into wider format, taking column names from Series Name column
  5. Country Name column was mutated into column of factor type and Year column was mutated into column of numeric type
  6. Column names were cleaned - all words were transformed into snake_case
wdi_cleaned_df <- wdi_df %>%
  select(-c(`Country Code`, `Series Code`)) %>%
  pivot_longer(-c(`Country Name`, `Series Name`), names_to = 'Year') %>%
  pivot_wider(names_from = `Series Name`) %>%
  mutate(`Country Name` = as_factor(`Country Name`),
         Year = as.numeric(word(Year))) 
wdi_names <- names(wdi_cleaned_df) %>% 
  set_names(make_clean_names(.)) %>%
  as.list
wdi_cleaned_df <- set_names(wdi_cleaned_df, names(wdi_names))

Currency Exchange Rates

  1. As part of loading data, Date column was parsed as column of date type
  2. All columns except Date were pivoted into longer format; names of columns were stored into currency column, values were stored into exchange_rate column
  3. Rows with NA values were dropped
  4. Column names were cleaned - all words were transformed into snake_case
  5. currency column was mutated into column of factor type
currency_df <- currency_df %>%
  pivot_longer(cols = -Date, names_to = 'currency',
               values_to = 'exchange_rate') %>%
  drop_na() %>%
  clean_names() %>%
  mutate(currency = as_factor(currency))

Gold prices

  1. Column names were cleaned - all words were transformed into snake_case
gold_prices_df <- clean_names(gold_prices_df)

S&P Composite

  1. Column names were cleaned - all words were transformed into snake_case
  2. year column was renamed to date
sp_df <- clean_names(sp_df) %>% rename(date = year)

Bitcoin

  1. Column names were cleaned - all words were transformed into snake_case
  2. All data frames loaded from CSV files were joined into one data frame
  3. Following columns were renamed:
    • value_diff to difficulty
    • value_hrate to hash_rate
    • value_mkpru to market_price_usd
    • value_trvou to exchange_trade_volume_usd
btc_diff_df <- clean_names(btc_diff_df)
btc_hrate_df <- clean_names(btc_hrate_df)
btc_mkpru_df <- clean_names(btc_mkpru_df)
btc_trvou_df <- clean_names(btc_trvou_df)

btc_df <- btc_diff_df %>%
  full_join(btc_hrate_df, by = 'date', suffix = c('_diff', '_hrate')) %>%
  full_join(btc_mkpru_df, by = 'date') %>% 
  full_join(btc_trvou_df, by = 'date', suffix = c('_mkpru', '_trvou')) %>%
  rename(difficulty = value_diff,
         hash_rate = value_hrate,
         market_price_usd = value_mkpru,
         exchange_trade_volume_usd = value_trvou)

Cleaned data set summary

In this section, summary with basic statistics for each of the data sets and their variables was presented.

tribble(
  ~Table, ~`No. of columns (attributes)`, ~`No. of rows (observations)`,
  'World Development Indicators', ncol(wdi_cleaned_df), nrow(wdi_cleaned_df),
  'Currency Exchange Rates', ncol(currency_df), nrow(currency_df),
  'Gold prices', ncol(gold_prices_df), nrow(gold_prices_df),
  'S&P Composite', ncol(sp_df), nrow(sp_df),
  'Bitcoin', ncol(btc_df), nrow(btc_df)
) %>% kbl() %>% kable_styling()
Table No. of columns (attributes) No. of rows (observations)
World Development Indicators 215 10608
Currency Exchange Rates 3 243689
Gold prices 7 13585
S&P Composite 10 1810
Bitcoin 5 4661

World Development Indicators

report_table <- function(df, round_digits=2) {
  datatable(df, style = 'bootstrap4', rownames = FALSE,
            options = list(scrollX = TRUE)) %>%
    formatRound(names(select_if(df, is.numeric)), round_digits)
}

wdi_skim_df <- wdi_cleaned_df %>% 
  set_names(wdi_names) %>%
  skim() 
wdi_skim_df %>%
  yank('factor') %>% 
  clean_names(case = 'title') %>%
  report_table
wdi_skim_df %>%
  yank('numeric') %>% 
  clean_names(case = 'title') %>%
  report_table()

Currency Exchange Rates

currency_df %>% 
  clean_names(case = 'title') %>%
  skim() %>% 
  clean_names(case = 'title') %>%
  report_table()

A closer look at the data in Currency Exchange Rates data set reveals ambiguity whether currency exchange rates are from given currency to U.S. Dollar or vice versa. By looking at exchange rate for Polish Zloty, we can deduce that exchange rates in the data set are from U.S. Dollar to Polish Zloty. On the other side, looking at exchange rates of Euro or U.K. Pound Sterling suggests that the conversion in these cases is reverse; it is common knowledge that Euro and U.K. Pound Sterling are more valuable currencies than U.S. Dollar, so their exchange rate should be below 1.0, but it is above 1.0. Therefore, whole data set was discarded from further analysis.

currency_df %>% 
  filter(currency %in% c('Polish Zloty', 'U.K. Pound Sterling', 'Euro',
                         'U.S. Dollar')) %>%
  group_by(date) %>% 
  filter(all(c('Polish Zloty', 'U.K. Pound Sterling', 'Euro', 'U.S. Dollar')
               %in% currency)) %>% 
  tail(4) %>% 
  clean_names(case = 'title') %>% 
  kbl() %>% kable_styling()
Date Currency Exchange Rate
2018-04-30 Euro 1.2079
2018-04-30 Polish Zloty 3.4868
2018-04-30 U.K. Pound Sterling 1.3725
2018-04-30 U.S. Dollar 1.0000

Gold prices

gold_prices_df %>% 
  clean_names(case = 'title', abbreviations = c('USD', 'GBP', 'AM', 'PM')) %>%
  skim() %>%
  clean_names(case = 'title') %>%
  report_table()

S&P Composite

sp_df %>% 
  clean_names(case = 'title', abbreviations = c('CPI')) %>% 
  rename(`S&P Composite` = `S p Composite`) %>%
  skim() %>% 
  clean_names(case = 'title') %>%
  report_table()

Bitcoin

btc_df %>% 
  clean_names(case = 'title', abbreviations = c('USD')) %>%
  skim() %>%
  clean_names(case = 'title') %>%
  report_table()

Finding correlations

Correlations between World Development Indicators were presented on the interactive correlation heat map below. Only variables with more than 50% complete rate (less than 50% of NA values) were selected. Correlation was calculated using the pairwise.complete.obs option which ensures that only complete pairs of observations of variables were used.

Hover over cells in the heat map to show correlated variables names and their correlation values.

wdi_na_filtered_df <- wdi_cleaned_df %>%
  set_names(wdi_names) %>%
  select(where(~mean(is.na(.)) < 0.50))

wdi_cor_matrix <- wdi_na_filtered_df %>%
  select(3:last_col()) %>%
  cor(use = 'pairwise.complete.obs')
wdi_cor_matrix[!lower.tri(wdi_cor_matrix)] <- NA

corr_plot <- wdi_cor_matrix %>% 
  ggcorrplot() +
  labs(x = 'Variable 1', y = 'Variable 2') + 
  theme_classic() +
  theme(axis.text = element_blank(), axis.ticks = element_blank()) 

ggplotly(corr_plot)

Below the heat map, table with all correlations in range (0.6, 0.9) was also included. Pair of variables with correlation higher than 0.9 were excluded, because they showed obvious relationships between variables, like percentage of urban and rural population or percentage of male and female population.

wdi_cor_matrix %>%
  as_cordf() %>%
  stretch() %>% 
  filter(abs(r) > 0.6, abs(r) < 0.9) %>%
  arrange(desc(abs(r))) %>%
  rename(X = x, Y = y, Correlation = r) %>%
  report_table(round_digits = 5)

GDP per capita correlations

Per capita gross domestic product (GDP) measures a country’s economic output per person and is calculated by dividing the GDP of a country by its population. Per capita GDP is a global measure used for gauging the prosperity of nations.

wdi_map_df <- wdi_cleaned_df %>% 
  filter(!country_name %in% c('Channel Islands', 'Kosovo', 
                              'Low & middle income', 'Low income', 
                              'Lower middle income', 'Middle income', 'World',
                              'Upper middle income', 'High income')) 

wdi_map_df %>%
  rename(`GDP per capita (USD)` = gdp_per_capita_current_us) %>%
  plot_geo(locationmode = 'country names') %>%
  add_trace(z = ~`GDP per capita (USD)`, zmin = 0, zmax = 70000, 
            color = ~`GDP per capita (USD)`,
            frame = ~year, 
            colors = 'Blues',
            text = ~country_name,
            locations = ~country_name,
            marker = list(line = list(color = toRGB("grey"), width = 0.5))) %>%
  colorbar(title = 'GDP per capita (USD)') %>% 
  layout(title = 'GDP per capita over years',
         geo = list(scope = 'world',
                    projection = list(type = 'natural earth'),
                    countrycolor = toRGB('grey'),
                    showcoastlines = TRUE)) %>%  
  animation_opts(redraw = FALSE) %>%  
  plotly_build()

Correlation between GDP per capita and life expectancy

Preston curve is an empirical relationship between life expectancy and GDP per capita. It indicates that individuals born in richer countries, on average, can expect to live longer than those born in poor countries. The first plot is a reproduction of the Preston curve for year 2019 and countries with GDP per capita below 70000 USD.

wdi_continents_df <- wdi_map_df %>%
  add_column(continent = countrycode(sourcevar = .$country_name,
                                     origin = "country.name",
                                     destination = "continent")) %>%
  relocate(continent)

preston_curve_plot <- wdi_continents_df %>% 
  filter(year == 2019, gdp_per_capita_current_us < 70000) %>%
  ggplot(aes(x = gdp_per_capita_current_us, y = life_expectancy_at_birth_total_years)) +
  geom_point(aes(color = continent, 
                text = paste('Country:', country_name,
                             '<br>Continent:', continent,
                             '<br>GDP per capita (USD):', round(gdp_per_capita_current_us),
                             '<br>Life expectancy (years):', round(life_expectancy_at_birth_total_years, digits = 2)))) +
  labs(x = 'GDP per capita (USD)',
       y = 'Life expectancy (years)', 
       color = "Continent") + 
  geom_smooth(se = FALSE)

ggplotly(preston_curve_plot, tooltip = 'text')

The second plot presents how GDP per capita and life expectancy changed over years in countries grouped by continents. It also shows that in general, countries with higher GDP tend to have a higher life expectancy. It shows that with passing years standard of living is improving across the world as well.

wdi_continents_df %>%
  ggplot(aes(gdp_per_capita_current_us, 
             life_expectancy_at_birth_total_years, 
             size = population_total, 
             color = continent)) +
  geom_point(alpha = 0.7) +
  scale_size_continuous(name = "Population",
                        breaks = c(1e9, 1e8, 1e7, 1e6),
                        limits = c(1e6, 1e9),
                        range = c(1, 8)) +
  scale_x_log10(labels = scales::comma) +
  theme_bw(base_size = 8) +
  labs(title = "Year: {round(frame_time)}", 
       x = 'GDP per capita (USD)',
       y = 'Life expectancy (years)', 
       color = "Continent",
       caption = "") +
  transition_time(year) +
  ease_aes('linear') 

High positive correlation can be immediately deduced by observing the animation, which is also confirmed by the following map plot.

wdi_life_expectancy_cor_df <- wdi_map_df %>% 
  group_by(country_name) %>%
  summarize(correlation = cor(gdp_per_capita_current_us,
                              life_expectancy_at_birth_total_years, 
                              use = 'pairwise.complete.obs'))

wdi_life_expectancy_cor_df %>%
  plot_geo(locationmode = 'country names') %>%
  add_trace(z = ~correlation, zmin = -1, zmax = 1, 
            color = ~correlation,
            colors = c("blue", "white", "red"),
            text = ~country_name,
            locations = ~country_name,
            marker = list(line = list(color = toRGB("grey"), width = 0.5))) %>%
  colorbar(title = 'Correlation') %>% 
  layout(title = 'GDP per capita (USD) - life expectancy (years) correlation',
         geo = list(scope = 'world',
                    projection = list(type = 'natural earth'),
                    countrycolor = toRGB('grey'),
                    showframe = TRUE,
                    showcoastlines = TRUE)) %>%  
  plotly_build()

Standard of living is the material well being of the average person in a given population. Life expectancy increases with the standard of living. GDP per capita is a good approximation for standard of living, which explains why there is a high correlation between GDP per capita and life expectancy.

Correlation between GDP per capita and infant mortality rate

To validate that decrease in per capita GDP is associated with an increase in infant mortality rate, as found by the authors of Aggregate Income Shocks and Infant Mortality in the Developing World paper, the variables change over years 1999-2015 was presented in selected countries.

infmort_plot <- wdi_continents_df %>%
  filter(country_name %in% c("United States", "Tonga", "Colombia", "Grenada", 
                             "Sri Lanka", "Malta", "Germany", "Japan", "Sweden", 
                             "Netherlands"),
         year %in% 1999:2015) %>%
  rename(`GDP per capita (USD)` = gdp_per_capita_current_us,
         `Infant mortality rate (per 1,000 live births)` =
           mortality_rate_infant_per_1_000_live_births,
         Continent = continent,
         Country = country_name) %>%
  ggplot(aes(x = `GDP per capita (USD)`, 
             y = `Infant mortality rate (per 1,000 live births)`,
             color = Continent,
             tooltip = Country)) + 
  geom_point() +
  facet_wrap(~year)

ggplotly(infmort_plot)

Once again, high correlation can be deduced from the plots, but this time it is negative, which is confirmed by the following map plot.

wdi_mortality_rate_cor_df <- wdi_map_df %>% 
  group_by(country_name) %>%
  summarize(correlation = cor(gdp_per_capita_current_us,
                              mortality_rate_infant_per_1_000_live_births, 
                              use = 'pairwise.complete.obs'))

wdi_mortality_rate_cor_df %>%
  plot_geo(locationmode = 'country names') %>%
  add_trace(z = ~correlation, zmin = -1, zmax = 1, 
            color = ~correlation,
            colors = c("blue", "white", "red"),
            text = ~country_name,
            locations = ~country_name,
            marker = list(line = list(color = toRGB("grey"), width = 0.5))) %>%
  colorbar(title = 'Correlation') %>% 
  layout(title = 'GDP per capita (USD) - mortality rate, infant (per 1,000 live births) correlation',
         geo = list(scope = 'world',
                    projection = list(type = 'natural earth'),
                    countrycolor = toRGB('grey'),
                    showframe = TRUE,
                    showcoastlines = TRUE)) %>%  
  plotly_build()

Correlation between gold prices and World Development Indicators

To explore correlation between gold prices and global development indicators (for example whole world’s GDP), several operations on data sets were made:

  1. Gold prices data set was grouped by year and stored into yearly_gold_prices_df data frame
    • gold_price_usd column was added, which is a mean of gold USD prices in the morning and in the afternoon
    • Rows with NA values of gold_price_usd column were dropped
    • Whole data set was grouped by year
    • Gold price was aggregated, calculating mean gold price per year
  2. S&P Composite data set was grouped by year and stored into yearly_sp_df data frame
    • Whole data set was grouped by year
    • S&P index was aggregated, calculating mean S&P index per year
  3. Resulting data sets from operations 2. and 3. were merged with data set of world’s development indicators and stored into world_joined_df data frame
    • Subset of rows with country_name equal to "World" was selected from data set which only includes columns with less than 50% of NA values
    • Inner join by year column with yearly_gold_prices_df data set from operation 1. was performed. This way only observations for years with known gold price were preserved
    • Left join by year column with yearly_sp_df data set from operation 2. was performed. This way no observations with known gold price were discarded

Bitcoin prices data set was discarded from further analysis, because it didn’t contain data before year 2009 - Bitcoin didn’t exist before that date. Including it into resulting joined data set would result in more than 50% observations with NA values for Bitcoin price variable.

yearly_gold_prices_df <- gold_prices_df %>% 
  mutate(gold_price_usd = (usd_am + usd_pm) / 2) %>%
  drop_na(gold_price_usd) %>%
  group_by(year = year(date)) %>% 
  summarize(gold_price_usd = mean(gold_price_usd))

yearly_sp_df <- sp_df %>% 
  group_by(year = year(date)) %>%
  summarize(sp_composite = mean(s_p_composite))

world_joined_df <- wdi_na_filtered_df %>%
  filter(`Country Name` == 'World') %>% 
  inner_join(yearly_gold_prices_df, by = c('Year' = 'year')) %>%
  left_join(yearly_sp_df, by = c('Year' = 'year')) %>%
  rename(`Gold price (USD)` = gold_price_usd,
         `S&P Composite` = sp_composite)

In the table below, all absolute correlations with gold price higher than 0.6 were presented.

world_cor <- world_joined_df %>%
  select(3:last_col()) %>%
  cor(use = 'pairwise.complete.obs') %>%
  remove_empty()

world_gold_cor <- world_cor %>% 
  as_cordf() %>%
  focus(`Gold price (USD)`) %>% 
  arrange(desc(abs(`Gold price (USD)`))) %>% 
  filter(abs(`Gold price (USD)`) > 0.6) %>%
  rename(`Gold price (USD) correlation` = `Gold price (USD)`,
         Variable = term)

report_table(world_gold_cor, round_digits = 5)

Gold price regressor

Reducing pair-wise correlations

Since in the table of gold correlations with other variables there is more than 50 variables with absolute correlation higher than 0.6, caret::findCorrelation function with cutoff parameter set to 0.9 was used to reduce pair-wise correlations.

vars_to_be_removed <- world_cor %>% findCorrelation()

world_gold_cor_subset <- world_joined_df %>% 
  select(3:last_col()) %>%
  select(!all_of(vars_to_be_removed)) %>%
  correlate() %>%
  focus(`Gold price (USD)`) %>%
  arrange(desc(abs(`Gold price (USD)`))) %>% 
  filter(abs(`Gold price (USD)`) > 0.6) %>%
  rename(`Gold price (USD) correlation` = `Gold price (USD)`,
         Variable = term)

correlated_variables <- world_gold_cor_subset %>% pull(Variable)

Most significant correlations were presented on the graph below.

world_gold_cor_plot <- world_gold_cor_subset %>% 
  ggplot(aes(x = reorder(Variable, `Gold price (USD) correlation`), 
             y = `Gold price (USD) correlation`,
             text = paste('Variable:', Variable,
                          '<br>Gold price (USD) correlation:', 
                          round(`Gold price (USD) correlation`, digits = 5)))) + 
  geom_col() + 
  ylim(-1, 1) +
  labs(x = '') + 
  coord_flip() + 
  theme_minimal()

ggplotly(world_gold_cor_plot, tooltip = 'text')

Handling missing values

Before training the model, there is a need to handle NA values in the data set in some way.

Missing values were imputed by using missForest library, which is based on Random Forest algorithm. missForest is one of the most accurate imputation methods, but at the cost of slow processing time. In this case, slow processing time is not a problem, because input data set is relatively small.

miss_forest <- world_joined_df %>%
  select(3:last_col()) %>%
  as.data.frame() %>%
  missForest(verbose = TRUE) 
##   removed variable(s) 10 11 12 30 31 38 40 49 50 51 53 68 70 96 due to the missingness of all entries
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.04191611 
##     difference(s): 0.007450749 
##     time: 1.934 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.02928387 
##     difference(s): 0.0008458608 
##     time: 1.913 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.02754529 
##     difference(s): 4.07246e-05 
##     time: 1.851 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.03082877 
##     difference(s): 1.441841e-05 
##     time: 1.878 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.03182396 
##     difference(s): 4.347241e-06 
##     time: 1.875 seconds
## 
##   missForest iteration 6 in progress...done!
##     estimated error(s): 0.02896512 
##     difference(s): 1.279036e-05 
##     time: 1.884 seconds
print(miss_forest$OOBerror)
##      NRMSE 
## 0.03182396

Training the model

Data set was split into training and testing partitions in 75/25 ratio. A validation set was created using repeated cross-validation with number of divisions equal to 2 and number of repetitions equal to 5.

imputed_df <- miss_forest$ximp %>%
  select(all_of(correlated_variables)) %>%
  cbind(Year = world_joined_df$Year,
        `Gold price (USD)` = world_joined_df$`Gold price (USD)`)

train_index <- createDataPartition(imputed_df$`Gold price (USD)`, 
                                   p = 0.75, 
                                   list = FALSE)
train_data <- imputed_df[train_index,]
test_data <- imputed_df[-train_index,]

ctrl <- trainControl(method = "repeatedcv",
                     number = 2,
                     repeats = 5)

The following graph shows the similarity of the distributions of the training and test data. Since input data consists only of 51 observations, there is some discrepancy between training and testing data partitions.

ggplot() +
  geom_density(aes(`Gold price (USD)`, fill = "Train"), train_data, alpha = 0.6) +
  geom_density(aes(`Gold price (USD)`, fill = "Test"), test_data, alpha = 0.6) +
  labs(x = "Gold price (USD)", y = "Density", fill = "Partition data set")

Random Forest regression model was used to predict the gold prices.

fit <- train(`Gold price (USD)` ~ . - Year,
             data = train_data,
             method = "rf",
             trControl = ctrl)

fit
## Random Forest 
## 
## 39 samples
## 26 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 5 times) 
## Summary of sample sizes: 20, 19, 20, 19, 19, 20, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    161.0114  0.9310910  102.3224
##   13    164.0964  0.9276829  104.2045
##   25    165.5569  0.9265937  106.4599
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.

Two measures, R2 and RMSE, were used to assess prediction accuracy.

R2 measures the strength of the relationship between the model and the dependent variable on a 0 – 100% scale.

Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. RMSE is a good measure of how accurately the model predicts the response.

prediction <- predict(fit, test_data)

post_resample <- postResample(pred = prediction,
                              obs = test_data$`Gold price (USD)`)
post_resample
##       RMSE   Rsquared        MAE 
## 58.3185252  0.9866305 51.8092988

R2 value of 0.9866305 and RMSE value of 58.3185252 mean that model is accurate.

The following graph shows the test set values and the model output values.

prediction_comparison_df <- tibble(year = test_data$Year, 
                                   actual = test_data$`Gold price (USD)`,
                                   predicted = prediction)

ggplot(prediction_comparison_df, aes(x = year)) +
  geom_line(aes(y = actual, color = "Test data set")) +
  geom_line(aes(y = predicted, color = "Predicted")) +
  labs(color = "Values", x = "Year", y = "Gold price (USD)")

Variable importance

ggplot(varImp(fit))

By looking at the above plot, it turned out that pure economic indicators, such as GDP, GDP per capita and gross national expenditure were the most important in predicting gold prices.

Following them were environment-related variables, such as CO2 emissions from liquid fuel consumption.

Lastly, another important group of indicators was related to population. The reason for importance of these variables might be the fact that population is growing faster than supply of gold. It can be expected that a significant percentage of population will be willing to purchase jewelry or electronic products, whose manufacture uses gold.